Home›Expertise›RAGS to Riches›Recursive chunking

Recursive chunking

📖 4 min readUpdated 2026-04-18

Recursive chunking is the pragmatic middle ground: try to split on natural boundaries (paragraphs, then sentences, then spaces), but enforce a size limit. It's simple, fast, and usually the best default. It's what LangChain's RecursiveCharacterTextSplitter does, and it's a big improvement over pure fixed-size chunking.

The algorithm

Define a list of separators in order of preference: typically ["\n\n", "\n", ". ", " ", ""]
Try to split the text on the first separator
If any resulting piece is still larger than the target size, recursively split that piece using the next separator
Continue until all pieces fit within the target size
Combine adjacent pieces to approach the target size without exceeding it

The result: chunks that prefer paragraph boundaries, fall back to sentence boundaries, and only split mid-sentence as a last resort.

Why it works well as a default

Respects natural text structure without requiring structure-aware parsing
Avoids splitting mid-sentence in most cases
Works on any plaintext without configuration
Fast: no embeddings required during chunking
Deterministic: same input produces same chunks

The separator list matters

Default separator lists are reasonable for generic prose. For specific content types, customize:

Markdown: add heading separators first: ["\n# ", "\n## ", "\n### ", "\n\n", "\n", ". ", " "]
Code: use language-aware separators (see chunking code)
HTML: convert to Markdown first, then use Markdown separators
Legal/technical: add numbered section separators: ["\n\n§", "\n\n(", "\n\n"]

Size tuning

Same sweet spots as fixed-size:

Chunk size: 400-800 tokens for general prose
Chunk overlap: 10-15% of chunk size
Use tokens, not characters

When recursive falls short

Very heterogeneous content (mixed prose, tables, code) where one separator strategy doesn't fit
Long narrative with few paragraph breaks (the recursive splitter falls back to sentence or word splits)
Content where structure should drive chunking (headings, sections) rather than just natural separators

The hybrid pattern

The strongest practical chunking strategy for most corpora:

Parse the document into structural elements (headings, paragraphs, lists, tables)
Apply recursive chunking within each structural element
Respect element boundaries, never merge a heading into a paragraph, never split a table across chunks

This combines recursive chunking's simplicity with structure-aware chunking's semantic respect. See structure-aware chunking.

Common mistakes

Forgetting to configure separators for non-English content (punctuation differs)
Using character-based when the embedding model is token-based (causes overflow on dense text)
Setting overlap too high (20-30%), which wastes embeddings and dilutes retrieval diversity
Using defaults on Markdown, which treats headings as regular text and fragments sections

What to do with this

Use recursive as your default. Customize the separator list for your content type.
Combine with structure-aware parsing for the strongest pragmatic chunker.
If your content is heavily Markdown, override separators to respect headings first.