Recursive chunking
📖 4 min readUpdated 2026-04-18
Recursive chunking is the pragmatic middle ground: try to split on natural boundaries (paragraphs, then sentences, then spaces), but enforce a size limit. It's simple, fast, and usually the best default. It's what LangChain's RecursiveCharacterTextSplitter does, and it's a big improvement over pure fixed-size chunking.
The algorithm
- Define a list of separators in order of preference: typically ["\n\n", "\n", ". ", " ", ""]
- Try to split the text on the first separator
- If any resulting piece is still larger than the target size, recursively split that piece using the next separator
- Continue until all pieces fit within the target size
- Combine adjacent pieces to approach the target size without exceeding it
The result: chunks that prefer paragraph boundaries, fall back to sentence boundaries, and only split mid-sentence as a last resort.
Why it works well as a default
- Respects natural text structure without requiring structure-aware parsing
- Avoids splitting mid-sentence in most cases
- Works on any plaintext without configuration
- Fast: no embeddings required during chunking
- Deterministic: same input produces same chunks
The separator list matters
Default separator lists are reasonable for generic prose. For specific content types, customize:
- Markdown: add heading separators first: ["\n# ", "\n## ", "\n### ", "\n\n", "\n", ". ", " "]
- Code: use language-aware separators (see chunking code)
- HTML: convert to Markdown first, then use Markdown separators
- Legal/technical: add numbered section separators: ["\n\n§", "\n\n(", "\n\n"]
Size tuning
Same sweet spots as fixed-size:
- Chunk size: 400-800 tokens for general prose
- Chunk overlap: 10-15% of chunk size
- Use tokens, not characters
When recursive falls short
- Very heterogeneous content (mixed prose, tables, code) where one separator strategy doesn't fit
- Long narrative with few paragraph breaks (the recursive splitter falls back to sentence or word splits)
- Content where structure should drive chunking (headings, sections) rather than just natural separators
The hybrid pattern
The strongest practical chunking strategy for most corpora:
- Parse the document into structural elements (headings, paragraphs, lists, tables)
- Apply recursive chunking within each structural element
- Respect element boundaries, never merge a heading into a paragraph, never split a table across chunks
This combines recursive chunking's simplicity with structure-aware chunking's semantic respect. See structure-aware chunking.
Common mistakes
- Forgetting to configure separators for non-English content (punctuation differs)
- Using character-based when the embedding model is token-based (causes overflow on dense text)
- Setting overlap too high (20-30%), which wastes embeddings and dilutes retrieval diversity
- Using defaults on Markdown, which treats headings as regular text and fragments sections
Next: Structure-aware chunking.