Structure-aware chunking uses the document's own organization as the chunk boundary: headings, sections, lists, tables. For any corpus that has real structure (technical docs, wikis, legal documents, Markdown, HTML), this is the highest-performing chunking strategy. It's also the most underused.
Authors already decided where ideas begin and end, that's what headings and paragraphs are. Respect their decisions.
Chunk boundaries come from:
Every chunk carries its heading path as metadata or prepended text. For a chunk under "Setup > Authentication > OAuth > Configuration", the chunk text might be prepended with:
# Setup > Authentication > OAuth > Configuration [actual chunk content...]
The LLM and embedding model both benefit from knowing where a chunk sits in the document hierarchy. Without this, a chunk about "rate limits" could be about API rate limits, OAuth token rate limits, or webhook rate limits, all of which might exist in the same document.
Never a chunk by itself. Attach to the following content.
Never split. Serialize as a single chunk with surrounding context. See tables and figures.
Never split within. Often a chunk by themselves plus a surrounding context chunk.
Keep lists together when possible. If a list is huge (e.g., a reference list), chunk by list sections or by logical groups.
Attach to nearest text chunk. Store figure metadata in chunk metadata. See tables and figures.
Markdown is the easiest case. Headings are unambiguous, paragraphs are clear, code blocks are explicit. A good Markdown chunker:
HTML is Markdown's messy cousin. Parse into DOM, identify content containers, extract hierarchical sections from heading tags. Convert to Markdown for downstream processing, then chunk as Markdown.
PDFs don't have guaranteed structure, but good parsers (Docling, Unstructured, Llamaparse) return typed elements with heading levels inferred from font size, style, and position. Quality of structure-aware chunking depends heavily on parser quality, see parsing PDFs.
For everything else, documentation, knowledge bases, reports, books, technical manuals, structure-aware chunking is worth the extra engineering.
Next: Chunking code.