Home›Expertise›RAGS to Riches›Structure-aware chunking

Structure-aware chunking

📖 5 min readUpdated 2026-04-18

Structure-aware chunking uses the document's own organization as the chunk boundary: headings, sections, lists, tables. For any corpus that has real structure (technical docs, wikis, legal documents, Markdown, HTML), this is the highest-performing chunking strategy. It's also the most underused.

The core idea

Authors already decided where ideas begin and end, that's what headings and paragraphs are. Respect their decisions.

Chunk boundaries come from:

Heading changes (H1, H2, H3 shifts)
Section breaks
List boundaries
Table boundaries
Code block boundaries

The algorithm

Parse document into typed elements with hierarchy (heading levels, sections, lists, etc.)
Walk the element tree, grouping adjacent elements under the same heading into candidate chunks
If a candidate chunk exceeds target size, split it by sub-element (or by recursive chunking within)
If a candidate chunk is too small, merge it with the next one (but only within the same section)
Never break across top-level sections or merge across major heading transitions

The heading context trick

Every chunk carries its heading path as metadata or prepended text. For a chunk under "Setup > Authentication > OAuth > Configuration", the chunk text might be prepended with:

# Setup > Authentication > OAuth > Configuration

[actual chunk content...]

The LLM and embedding model both benefit from knowing where a chunk sits in the document hierarchy. Without this, a chunk about "rate limits" could be about API rate limits, OAuth token rate limits, or webhook rate limits, all of which might exist in the same document.

Element-type-specific rules

Headings

Never a chunk by itself. Attach to the following content.

Tables

Never split. Serialize as a single chunk with surrounding context. See tables and figures.

Code blocks

Never split within. Often a chunk by themselves plus a surrounding context chunk.

Lists

Keep lists together when possible. If a list is huge (e.g., a reference list), chunk by list sections or by logical groups.

Images/figures

Attach to nearest text chunk. Store figure metadata in chunk metadata. See tables and figures.

The Markdown case

Markdown is the easiest case. Headings are unambiguous, paragraphs are clear, code blocks are explicit. A good Markdown chunker:

Parses into an AST
Walks headings, creating sections
Produces one chunk per section, with heading hierarchy as metadata
Splits large sections on subheadings or natural paragraph boundaries
Preserves code blocks, tables, and lists as atomic units

The HTML case

HTML is Markdown's messy cousin. Parse into DOM, identify content containers, extract hierarchical sections from heading tags. Convert to Markdown for downstream processing, then chunk as Markdown.

The PDF case

PDFs don't have guaranteed structure, but good parsers (Docling, Unstructured, Llamaparse) return typed elements with heading levels inferred from font size, style, and position. Quality of structure-aware chunking depends heavily on parser quality, see parsing PDFs.

Why it outperforms fixed-size

Chunks align with semantic units (one section = one chunk)
Heading context disambiguates ambiguous terms
Retrieved chunks are self-contained and LLM-friendly
Citations are natural ("from Section 3.2")
User experience improves because returned chunks look like explanations, not fragments

Implementation

Unstructured.io returns typed elements suitable for structure-aware chunking out of the box
LlamaIndex MarkdownNodeParser / HTMLNodeParser handle structure
Custom parsers for domain-specific document types (legal contracts, medical records) often pay back quickly

When it's overkill

Plain text without structure (chat logs, emails without formatting)
Very short documents where one chunk covers everything
Homogeneous corpora (all blog posts of similar length and shape)

For everything else, documentation, knowledge bases, reports, books, technical manuals, structure-aware chunking is worth the extra engineering.

What to do with this

If your content has real structure, use it. Don't throw it away.
Always prepend the heading path as context to each chunk.
Combine with recursive chunking within big sections to stay within size limits.